Please document your answers to all homework questions using R Markdown, submitting your compiled output as a zipped .html folder (this is necessary when using plotly and leaflet).
\(~\)
Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.
## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
data <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")
For Part A, use the stringr package, write code to clean
these data by removing the Unicode values (strings like
<U+00A0>). To do this, you should assume that
anything appearing inside of the characters < and
> can be removed.
# Your code for A here
On twitter, a user may echo another user’s tweet to share it with
their own followers by “retweeting”. In these data, all retweets begin
with the letters “RT” followed by “@” and the original user’s twitter
name. For this question, write code that stores retweets into a separate
data set, then use the length function to find the number
of tweets in this dataset.
# Your code for B here
After excluding retweets, find the number of tweets where “hate” or
“hated” (of any capitalization) appear, and the number of tweets where
“love”, “loved”, or “looved” (and all variants with more “o”s or other
capitalization) appear. Hint: the sum() function
can be used to count the number of TRUE elements in a
logical vector, which can be used in conjunction with
str_detect() to answer this question. You might also find
logical negation, achieved using the ! character,
to be helpful in creating a subset of non-re tweets. We’ve seen this
before in-class with the command !is.na(...) being used to
select cases without missing values.
# Your code for C here
Side Remark: To download tweets from Twitter, you need to have a Twitter account and then sign into the developer page. Analyzing twitter data makes for a potentially interesting project. Details on the authentication procedure can be found at this link: http://thinktostart.com/twitter-authentification-with-r/
\(~\)
The Happy Planet Index is an attempt to measure how well different world nations are doing at achieving long, happy, and sustainable lives for their citizens using data compiled from various sources. A description of the dataset’s variable can be found on slide 11 here
For this question, use the plot_ly function in the
plotly package to construct a 3-D scatter plot of
“LifeExpectancy” and “GDPperCapita” vs “Happiness” with a fitted linear
regression plane (found using lm()) depicting the model
Happiness ~ LifeExpectancy + GDPperCapita. Your graph
should include hoverable labels displaying the country represented by
that data-point. You may use the argument
hoverinfo = "text" so that these labels only provide the
text label you specify (and not the x, y, z coordinates of the
point).
HappyPlanet <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
Your final result should look something like the graphic shown below (it does not need to resemble it exactly, but it should be similar):
Note: colorscale = "RdBu" was used above.
\(~\)
A precinct is the smallest geographic unit for which aggregated voting data is publicly available. For this question, you will work with precinct-level election data for the state of Iowa from the 2020 US presidential election. These data were acquired through Harvard’s Dataverse, and were published by the Voting and Election Science Team in 2020 (https://doi.org/10.7910/DVN/K7760H).
To begin, you should download this zipped folder and extract it to an accessible location on your PC. The code below reads these data on my PC (note that I have the iowa_precincts folder in my downloads).
library(leaflet)
library(maptools)
iowa <- readShapeSpatial("C:/Users/millerry/Downloads/iowa_precincts/ia_2020") ## You will need to change this file path
In Part A your goal is to create a ggplot map where each
precinct is colored according the margin by which it favored Donald
Trump (votes recorded in “G20PRERTRU”) or Joe Biden (votes recorded in
“G20PREDBID”). The code below creates a new variable, “MARGIN”, that you
can use for this purpose.
## Relative difference in Trump vs. Biden votes
iowa@data$MARGIN = (iowa@data$G20PRERTRU - iowa@data$G20PREDBID)/(
iowa@data$G20PRERTRU + iowa@data$G20PREDBID +
iowa@data$G20PRELJOR + iowa@data$G20PREGHAW)
Your final result should look something like the map below (it does not need to resemble it exactly, but it should be similar):
\(~\)
In Part B your goal is to make an interactive leaflet map of Poweshiek County (home to Grinnell) and Jasper County (just west of Grinnell) showing each precinct’s name and vote totals for Donald Trump and Joe Biden when you hover over it. The map should also include a highlight that clearly displays which precinct the user is hovering over.
You may use the following code to subset the spatial polygons file to include only Poweshiek County.
## Create subset (note that pow_co is a spatial polygons file, just like "iowa")
pow_co <- iowa[iowa$COUNTY %in% c("Poweshiek", "Jasper"), ]
Your final result should look something like the map below (it doesn’t need to be exactly identical):